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Abstract 

We present techniques for effective Gaussian process (GP) modelling of multiple short time 
series. These problems are common when applying GP models independently to each gene in 
a gene expression time series data set. Such sets typically contain very few time points. Naive 
application of common GP modelling techniques can lead to severe over-fitting or under-fitting 
in a significant fraction of the fitted models, depending on the details of the data set. We propose 
avoiding over-fitting by constraining the GP length-scale to values that focus most of the energy 
spectrum to frequencies below the Nyquist frequency corresponding to the sampling frequency 
in the data set. Under-fitting can be avoided by more informative priors on observation noise. 
Combining these methods allows applying GP methods reliably automatically to large numbers 
of independent instances of short time series. This is illustrated with experiments with both 
synthetic data and real gene expression data. 



1 Introduction 



Gaussian processes (GPs) are a widely applied non-parametric probabilistic model for continuous 
data (Rasmussen and Williams, 2006). Because of their non-parametric nature, they can flexibly 



adapt to differently sized data sets and can easily accommodate for example non-uniformly sampled 
data. GPs are computationally very convenient, because they permit exact marginalisation of the 
latent process in regression with a Gaussian likelihood. 

Most methods development work on GPs in machine learning has focused on developing effi- 
cient inference for large data sets (see, e.g. Quihonero Candela and Rasmussen, 2005 Snelson and 



Ghahramani, 2006 Rasmussen and Williams 



2006 



Titsias 2009). This is an important area, as 



naive inference algorithms suffer from cubic computational complexity with respect to the data set 
size, and the recently developed methods can successfully reduce this significantly. 

In this paper we focus instead on the other frontier of GP applications in data sets with a very 
large number of small independent instances. GPs for such applications have recently gathered 
significant interest in computational systems biology, where they provide a very powerful model for 



sparsely and often irregularly sampled gene expression time series (jLawrence et al. 



2008; Kirk and Stumpf, 2009; Stegle et al. , 20101 Liu et al. 
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(a) The data 



(b) The log-likelihood surface 



Figure 1: (a) The underlying function (blue line) for the synthetic example and an example data 
set instance (red dots), (b) Contour plot of the log-likelihood surface for different I and <r^ values 
corresponding to the data set in (a) showing the maximum likelihood solution (black dot) with a 
very small a\ ~ 0.005 and I ~ 0.9, when no bounds are introduced. 



independent models is important in many applications of these models, such as ranking of targets 
of gene regulators (Honkela et al. 2010). 

Most gene expression time series are very short with a great majority having less than 9 time 
points (Ernst et al. 2005), so computational complexity of any GP inference method will typically 
not be an issue. Instead, the application of GP methods in these problems will face other problems 
due to lack of and sparseness of data. Depending on the specific problem, this can easily lead to either 
over-fitting or under-fitting. When fitting the models automatically to a large number, possibly 
several thousands, of instances, it is impractical to manually locate and fix these problematic fits. 
In this paper we present methods for setting constraints or more restrictive priors to some model 
parameters that help avoid these phenomena. Earlier heuristic variants of the length-scale bound 
have been applied in some previous works (Honkela et al. 2011 Titsias et al. , 2012 ) without detailed 
justification, but here we present a new rigorous derivation for the bound. 



2 GP modelling methods for small data sets 



A GP is a stochastic process {/(x)|x G X} for which the marginal distribution at any finite sub- 
collection of points xi,...,x n is multivariate Gaussian (Rasmussen and Williams |2006 ). The 
process is completely defined by the mean function /x(x) and the covariance function /c(x, x'), that 
also define the mean vector and covariance matrix of the multivariate Gaussian over the sub- 
collection. For simplicity, we assume the mean function is identically zero /u(x) = 0. 

The most widely used covariance function for GPs in machine learning is the squared exponential 



covariance (Rasmussen and Williams, 2006) 



/c S e(x,x) = ajexp 



where r = ||x 
length-scale L 



The covariance depends on two positive hyperparameters: variance a'j 



(1) 



and 



The squared exponential covariance is infinitely smooth, which is often too strong 
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Figure 2: A GP model fitted to 7 data points generated from f(x) = sm ^ x > showing over-fitting to 
the data. The plot shows the posterior mean of the GP (black solid line) together with two standard 
deviation posterior credible regions both for the squared exponential covariance (dashed line) and 
squared exponential plus noise covariance (dotted line). The two posterior credible regions are very 
close to each other, indicating a very small estimated observation noise level. 



an assumption. A simple generalisation is given by the Matern class covariance functions 



/CMatern(x, X ) = CT /f ^y I — — I K v I — — I , (2) 



with additional positive hyperparameter u, where K v is a modified Bessel function (Rasmussen and 



Williams} |2006[ ). 

Both these covariance functions have a length-scale parameter £ that governs the range of de- 
pendencies in the process. A short length scale corresponds to rapidly varying functions with weak 
long-range dependencies, while a large length-scale corresponds to slowly varying functions. Ex- 
tremely small length-scale may lead to a situation where each observation is treated as essentially 
independent, which makes the model over- fit. 



2.1 Illustrative Example 

We illustrate fitting GPs to small data sets with synthetic data generated from a sinc(x) function, 
that is, f(x) = sm ( x ^ ; by uniformly sampling 7 data points in the interval [-5,6] with a noise term 
which is normally distributed with mean and variance 0.09. An example of such a data set is 
shown in Fig. [TJa). 

Fitting a GP model with squared exponential covariance to the data set shown in Fig. [ija) and 
selecting the maximum likelihood (ML) solution for the squared exponential covariance variance a'j 
leads to a likelihood surface for length scale I and observation noise variance a\ shown in Fig. [TJb) . 
The ML estimate for the noise is clearly much smaller than the generative value indicating severe 
over-fitting to the data. This corresponds to a fairly small value for the length-scale compared to 
the sampling rate in the data. The GP model corresponding to the ML fit is shown in Fig. [2] 



2.2 Length-scale bounds 

As seen above, naive application of common GP modelling techniques can lead to severe over- fitting 
or under-fitting, depending on the details of the data set. We propose avoiding over- fitting by 
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constraining the GP length-scale t to values that focus most of the energy spectrum to frequencies 
below the Nyquist frequency corresponding to the sampling in the data set. According to the 
Nyquist sampling theorem, the Nyquist frequency f n = is the maximal frequency that can be 



identified in the spectral representation of the sampled signal (Tick and Shaman, 1966, Oppenheim 



and Schafer 1975). Here At is the sampling interval in the data set. In case of non-uniformly 



sampled data, we define At conservatively as the shortest distance between consecutive data points 
to obtain the least restrictive bound. 

In case of the squared exponential covariance function, the spectral density is given by 



Sse(s) = (2x0 



2\D/2 



exp{-2n 2 £ 2 s 2 ) 



(3) 



where D is the number of dimensions and s denotes the frequency ( Rasmussen and Williams , 2006| ) . 
For D = 1, the corresponding lower bound for the length-scale can be found by solving the inequality 



i 

2At 

I SSE{S) dS = 6rf ( vfe) 

i 

" 2 At 



> a, 



(4) 



where a denotes the fraction of the system's energy on the frequencies that are below the Nyquist 
frequency. 

Solving for I and setting a to 0.99, we can obtain the lower bound for the length scale parameter 
that would constrain at least 99% of the process's energy on the frequencies which are below the 
Nyquist frequency: 

I > a e (a) = ^ erfinv ( a ) At ra 0.8199 x At, (5) 



7T 



where erfinv(x) denotes the inverse of the error function. 

For the 1-dimensional Matern covariance function the corresponding spectral density is 



S'Matern(s) 



' T + 4xV 



from which we can derive the fraction of energy in the frequency interval [—1/2 At, 1/2 At] as 



(6) 
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2 At 



SMa 



(s) ds 
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(7) 



where 2 -^i(o, 6, c, x) is the hypergeometric function (Abramowitz and Stegun, 1965). Given a fixed 
value of v, appropriate lower bound ai for t can be derived from Eq. ([7]) using numerical optimisation. 

For Bayesian parameter estimation, a uniform prior over the length scale over the interval 
[a£,t n — t\] seems like a reasonable objective prior. 



2.3 Variance prior 

Bounding the length-scale as above can avoid most obvious over-fitting, but naive estimation of 
observation noise will still frequently lead to mis-estimation of the noise and hence effectively over- 
fitting or under-fitting the data. The easiest way to avoid this is to use informative priors on the 
noise that focus the distribution away from implausibly small and large values. 
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Figure 3: Fraction of over-fitted models in different setups with synthetic data. Setting the length- 
scale bound (a) automatically eliminates all problems with length scales and a noise bound (b) with 
the noise values, and the corresponding curves are overlapping at constant 0. 
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Figure 4: Fraction of models with low predictive log-likelihood values (a) and with high MSE values 
(b) in different setups with synthetic data. 



For gene expression data, one can for example use posterior variances of the inferred expres- 



sion levels from pre-processing both for microarrays (Liu et al. 2005; Du et al. 2008) and RNA 



sequencing (Turro et al. 



in (Lawrence et al. 2007 



As an alternative, Coo' 



2011 Glaus et al. 2012). This kind of approach has been applied e.g. 



Gao et al., 2008 Honkela et al., 2010 Titsias et al., 2012) 



ce et al. (2011 ) use an empirical variance estimate from multiple replicates 



as the approximate lower bound and total variance as the approximate upper bound for a semi- 
empirical prior on the variance. 



3 Experiments 

We present experimental results highlighting the over-fitting caused by bad length-scale estimates 
on synthetic data and real gene expression data. 
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= 0.2203, of = 0.0551 
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(a) No bounds 



(b) Bounded I 




or = 0.1, qf = 0.1869 
I - 1.5032 

log-likelihood - -5.5123 




(c) Bounded 



(d) Bounded £ and bounded a\ 



Figure 5: Examples of GP model fits for a synthetic data instance. Using no bounds in (a) leads 
to severe over- fitting which is compensated by under- fitting by introducing the I bound ([a^,oo) = 
[f.5032,oo), from At 1.8334) in (b), while introducing only the bound ([0.01,0.1]) does not 
help avoid over-fitting in (c). Introducing the i and a\ bounds together in (d) leads to a reasonably 
good fit to the original function. 



3.1 Synthetic data 

We generated synthetic data using a procedure similar to the one described in Sec. |2.1[ We created 
1000 independent instances of the data set with different noise realisations. We repeated this 
sampling n = 5, 7, . . . , 15 equally spaced points on the interval [—5, 6]. 

For each of these data sets, we fitted the model in four different scenarios: 

1. Unconstrained I parameter estimation and noise estimation; 

2. Unconstrained a\ estimation and bounded i in the range [o£,oo); 

3. Bounded cr^ in the range [0.01,0.1] and unconstrained £; and 

4. Bounded o\ in the range [0.01,0.1] and bounded I in the range [ae,oo). 

For each n and each scenario, we recorded the number of fits with i < ag and o\ < 0.0001, both 
of which are indications of over- fitting. The results are shown in Fig. [3} For very small data sets, 
practically all instances lead to a short length scale, and a significant fraction has a very small 
estimated noise variance. For larger data sets the fractions drop, but remain well above zero. 
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Alternatively, to compare how the fitted GP models coincide with the true underlying function, 
we calculated the mean squared errors (MSE) by using 10 test points, equally spaced in the range [- 
6,5] . MSE values can be found simply by calculating the mean of the squared differences between the 
true underlying function values and the predicted function values at the test points. Additionally, 
we also computed the predictive log- likelihood values for the true function values at the test points. 
Small MSE values, and large log-likelihood values can be considered as the indicators of a good fit, 
whereas high MSE values and low log-likelihood values indicate the opposite. 

The frequencies of the instances with very low predictive log-likelihood values (smaller than 
—20) and with very high MSE values (larger than 0.1) can be seen in Fig. ga) and Fig. Qb), 
respectively. It is clear that the frequencies are very high if no bounds are set to the parameters. 
Once the parameter bounds are introduced, the frequencies start to decrease, with the largest 
decrease occuring in the instances with small sample sizes. Furthermore, for each instance, we 
recorded in which setup the fitted GP model leads to the smallest MSE and the largest predictive 
log-likelihood values. In Table [T] and Table [2j the fractions of the largest predictive log-likelihood 
values and the smallest MSE values in four different setups are presented, supporting the fact 
that using the length scale and noise variance bounds together improves the model fit drastically 
especially in the instances with small sample sizes. 

Model fits for an example realisation, with sample size 7, suffering from over-fitting with un- 
bounded estimation are shown in Fig. [5j The over-fitting in the unbounded estimate in Fig. [5ja) 
is clearly remedied by introducing the length-scale bound in Fig. [5^b) , but this makes the model 
under-fit. Also constraining the noise variance to sensible values in Fig. [5^d) leads to a reasonably 
good fit considering the amount of data. 



Table 1: Fractions (%) of the largest log- likelihoods in different setups with synthetic data (n 
denotes the sample size) 





n = 5 


n = 7 


n = 9 


n = 11 


n = 13 


n = 15 


NO BOUNDS 


2.2 


3.0 


9.6 


10.6 


11.9 


11.6 


£ BOUNDED 


1.0 


8.5 


11.6 


14.9 


13.7 


13.0 


of> BOUNDED 


28.3 


34.2 


44.1 


37.6 


39.7 


34.6 


£ AND (X* BOUNDED 


68.5 


54.3 


34.7 


36.9 


34.7 


40.8 



Table 2: Fractions (%) of the smallest MSEs in different settings with synthetic data (n denotes 
the sample size) 





n = 5 


n = 7 


n = 9 


n = 11 


n = 13 


n = 15 


NO BOUNDS 


8.1 


9.4 


13.7 


19.1 


22.3 


23.6 


£ BOUNDED 


5.7 


14.3 


20.0 


22.6 


25.0 


26.0 


a\ BOUNDED 


27.0 


36.2 


38.8 


28.9 


26.5 


21.5 


£ AND ofj BOUNDED 


59.2 


40.1 


27.5 


29.4 


26.2 


28.9 



3.2 Gene expression data 

We apply the model to fruit fly Drosophila melanogaster developmental gene expression time 



series from (Tomancak et al. 2002) using the differential-equation-based gene regulation model 



from (Lawrence et al., 2007 Gao et al., 2008 Honkela et al., 2010). 
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Table 3: Proportion of over-fitted single-target cascaded differential equation models according to 
different criteria in different experimental setups. '.' indicates a situation that is not possible 
because of the bounds. 





BIN 


MEF2 


TIN 


TWI 




I < a e 


a'i < 0.01 


I < ai 


a'i < 0.01 


£ < a e 


a'i < 0.01 


£ < a e 


a'i < 0.01 


NO BOUNDS 


0.0% 


0.0% 


0.0% 


0.3% 


13.9% 


2.9% 


0.0% 


1.6% 


£ BOUNDED 




0.0% 




0.3% 




4.1% 




1.6% 


a'i FIXED 


1.8% 




1.9% 




5.3% 




7.1% 




£ BOUNDED, a'i FIXED 



















The experiments were run using a modified version of the tigre Bioconductor package ( Honkela 



et al. 2011). For each model, we tested 4 different scenarios: 

1. Unconstrained I parameter estimation and noise a'i estimation; 

2. Unconstrained ai estimation and bounded £; 

3. Fixed ai from pre-processing and unconstrained £; and 

4. Fixed ai and bounded I. 



For the last two, the noise variances were obtained from data pre-processing as in (Honkela et al. 



2010). The models were run for 6795 genes passing the significant activity filter described in (Honkela 
et al.l 120101). 



Single-target cascaded differential equation models As the first example, we tested single- 
target cascaded differential equation models linking observed regulator TF mRNA to candidate 
target mRNA. We ran the model for 4 TFs: BIN, MEF2, TIN and TWI. Fractions of genes with 
t < ae or ai < 0.01 for each TF for each setting are listed in Table [3| 

The results show a moderate number of genes that exhibit symptoms of over-fitting. The 
numbers vary significantly for different TFs, as the strength of the driving signal is different. The 
expression input for TIN is especially weak and sharply peaked, which leads to relatively larger 
number of over-fitted models without the precautions proposed in this paper. An example of such 
a model is illustrated in Fig. [6} 



Multiple-target models As the second example we consider a slightly more difficult task of 
fitting a single- layer differential equation model with no TF mRNA input (Gao et al. 2008). We 
fit the model to each gene in turn with two fixed known targets, MEF2 and TIN. These are both 
known targets of TWI (Zinzen et al. 2009), so these models could plausibly be used to discover 
further targets of TWI. Each model was fitted independently without using any information from 
the previous models. 

Fractions of genes with different symptoms of over- fitting are listed in Table |4} The numbers are 
in most cases slightly higher than in Table [3j demonstrating that this is a more challenging task. 



4 Discussion 

We have presented methods for improving large-scale learning of GP models for very many instances 
of small data sets. We presented a novel rigorous derivation for a bound for sensible length scales 
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Figure 6: Example of a single-target cascaded ODE gene regulation model. Using no bounds in (a) 
leads to severe over- fitting which is mostly corrected by I bound in (b). More accurate noise model 
in (c) further improves the accuracy slightly. 



Table 4: Proportion of over-fitted multiple-target non-cascaded differential equation models accord- 
ing to different criteria in different experimental setups. 





£ < a e 


a'i < 0.01 


NO BOUNDS 


7.3% 


8.3% 


i BOUNDED 




6.7% 


<T 2 n FIXED 


6.7% 




I BOUNDED, a\ FIXED 







for squared exponential and Matern covariance functions. The bound is based on constraining the 
energy spectrum of the GP covariance to frequencies that can plausibly be reconstructed from the 
data. This can be intuitively justified by the fact that the data was collected at such sampling rate 
in the first place, which encodes a prior assumption about the time scale of interest. This bound 
clearly helps avoid many cases of obvious over-fitting, as illustrated with both synthetic and real 
data experiments. 

The usual underlying reason for the small length-scale fits is often that a smooth model may 
not be a very good fit to the specific instance of data. The GP model may attempt to compensate 
for this by using more functional degrees of freedom than seems plausible and essentially modelling 
each single observation independently. For highly constrained models such as the linear ODE 
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models in Sec. |3.2| the results are usually not very severe, but for more flexible models such as ones 
incorporating gene-dependent delays the over-fitting can be even more severe problem. 

One may wonder whether the observed small length-scales are caused by using point estimates 
for the parameters. This does not appear to be the illustrated by the likelihood surface 

in Fig. [l|b) which shows a very smooth maximum in small length-scale regime. This is further 
supported by the fact that some form of length-scale bounds was needed by Titsias et al. (2012) for 
a method completely based on MCMC. 

We believe the presented length-scale bound is an important ingredient for large-scale learning 
of multiple independent GP models for short time series that are very common in biology. Carefully 
justified priors will be especially important in the future when moving beyond the simple linear ODE 
models of Honkela et al. (2010) to more realistic and flexible models. Using such models without 
suitable constraints or suitably constrained priors can easily lead to unexpected over-fitting failure 
modes. 
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